Overall, we want to know what all factors are impacting the demand of capital bike share. To explore this, we asking the following SMART questions:
What is the impact of seasonality on bike rental demand?
What are the key factors that influence bike rental demand, and how do they affect the overall bike- sharing system performance?
What are the most popular bike stations and routes, and how do they vary by time of day, season, and day of the week?
How does the weather (temperature, humidity, wind speed) impact bike usage patterns?
How does bike usage vary during holidays compared to regular days?
Three distinct sources were utilized to gather the data. The website links for the respective sources are as follows:
| Variable | Description |
|---|---|
| temperature | The temperature in degrees Fahrenheit. |
| feelsliketemp | The “feels like” temperature in degrees Fahrenheit, which takes into account factors such as humidity and wind. |
| dew | The dew point in degrees Fahrenheit. |
| humidity | The relative humidity as a percentage. |
| windspeed | The wind speed in miles per hour. |
| uvindex | The UV index, which is a measure of the strength of ultraviolet radiation from the sun. |
| weather | A categorical variable describing the weather conditions (sunny, cloudy, rainy, etc.). |
| Variable | Description |
|---|---|
| Date | Holiday day and month |
| holiday | Name of the Holiday |
The dataset used for this analysis was collected from three sources, namely Cabi Share, weather (visual crossing), and US holidays (time and date). The Cabi Share data spans the time period from October 2010 to March 2023, and thus we filtered the weather and US holidays data for the same duration to ensure consistency. To ensure accuracy and relevance, we filtered the weather data specifically for Washington city, as the Cabi Share data only pertains to this location. The holiday data was extracted from a website and transferred to an Excel sheet, and both R and Excel commands were used to remove any unnecessary columns from the data.
Once all three data sources were filtered and cleaned, we integrated them into a single CSV file. This file was then used for further analysis and modeling purposes. The process of filtering, cleaning, and merging the data ensured that we had a reliable and relevant dataset to work with, enabling us to gain insights into various aspects of bike sharing, weather patterns, and holiday trends. Overall, the use of multiple data sources and rigorous data cleaning and integration processes ensured that our analysis was based on high-quality and accurate data.
1.Sources of data : We collected data from three sources, namely Cabi Share, weather(visual crossing), and US holidays (time and date).
Time frame : The Cabi Share data covers the time-period from October 2010 to March 2023. Therefore, we filtered the weather and US holidays data for the same duration of time to ensure consistency.
Filtering weather data : As the Cabi Share data only pertains to Washington city, we decided to filter the weather data specifically for this location. This was done to ensure that the data collected is relevant and accurate for our analysis.
Extracting holiday data : We extracted the holiday data table from a website and transferred it to an Excel sheet. We then used both R and Excel commands to remove any unnecessary columns from the data.
Merging data : Once we had filtered and cleaned all three data sources, we integrated them into a single CSV file. This file was then used for further analysis and modeling purposes.
Overall, this process ensured that we had a reliable and relevant dataset to work with, enabling us to gain insights into various aspects of bike sharing, weather patterns, and holiday trends.
The final selected variables are :
| Variable | Description |
|---|---|
| started_at | The date and time when the bike rental started. |
| start_station_name | The name of the bike station where the rental started. |
| member_type | The type of member who rented the bike (casual or member). |
| duration | The duration of the bike rental in seconds. |
| noofbikes | The number of bikes rented for this rental. |
| temperature | The temperature in degrees Fahrenheit. |
| feelsliketemp | The “feels like” temperature in degrees Fahrenheit, which takes into account factors such as humidity and wind. |
| dew | The dew point in degrees Fahrenheit. |
| humidity | The relative humidity as a percentage. |
| windspeed | The wind speed in miles per hour. |
| uvindex | The UV index, which is a measure of the strength of ultraviolet radiation from the sun. |
| weather | A categorical variable describing the weather conditions (sunny, cloudy, rainy, etc.). |
| weekday | A categorical variable indicating the weekend or weekday. |
| holiday | A variable indicating whether or not the rental occurred on a holiday. |
| season | A categorical variable indicating the season (spring, summer, fall, or winter). |
| date | The date of the bike rental. |
| month | The month of the bike rental. |
| year | The year of the bike rental. |
There are 603620 NULL values and 0 duplicates in the dataframe.
The Variables and their datatypes before cleaning and Formatting:
|
x |
|
|---|---|
|
started_at |
character |
|
start_station_name |
character |
|
member_type |
character |
|
duration |
numeric |
|
noofbikes |
integer |
|
temperature |
numeric |
|
feelsliketemp |
numeric |
|
dew |
numeric |
|
humidity |
numeric |
|
windspeed |
numeric |
|
uvindex |
integer |
|
weather |
character |
|
weekday |
character |
|
holiday |
character |
|
season |
character |
|
date |
integer |
|
month |
integer |
|
year |
integer |
The Summary of data set before cleaning and Formatting:
|
started_at |
start_station_name |
member_type |
duration |
noofbikes |
temperature |
feelsliketemp |
dew |
humidity |
windspeed |
uvindex |
weather |
weekday |
holiday |
season |
date |
month |
year |
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Length:1609103 |
Length:1609103 |
Length:1609103 |
Min. :-1737539 |
Min. : 1.00 |
Min. :-9.70 |
Min. :-16.40 |
Min. :-19.20 |
Min. :19.00 |
Min. : 4.30 |
Min. : 0.000 |
Length:1609103 |
Length:1609103 |
Length:1609103 |
Length:1609103 |
Min. : 1.00 |
Min. : 1.00 |
Min. :2010 |
|
|
Class :character |
Class :character |
Class :character |
1st Qu.: 625 |
1st Qu.: 3.00 |
1st Qu.:26.60 |
1st Qu.: 25.50 |
1st Qu.: 17.20 |
1st Qu.:53.50 |
1st Qu.:12.30 |
1st Qu.: 4.000 |
Class :character |
Class :character |
Class :character |
Class :character |
1st Qu.: 8.00 |
1st Qu.: 4.00 |
1st Qu.:2015 |
|
|
Mode :character |
Mode :character |
Mode :character |
Median : 845 |
Median : 11.00 |
Median :50.60 |
Median : 49.20 |
Median : 37.10 |
Median :64.00 |
Median :15.80 |
Median : 6.000 |
Mode :character |
Mode :character |
Mode :character |
Mode :character |
Median :16.00 |
Median : 7.00 |
Median :2018 |
|
|
NA |
NA |
NA |
Mean : 1399 |
Mean : 21.88 |
Mean :48.46 |
Mean : 47.43 |
Mean : 36.55 |
Mean :63.56 |
Mean :17.37 |
Mean : 5.805 |
NA |
NA |
NA |
NA |
Mean :15.72 |
Mean : 6.58 |
Mean :2018 |
|
|
NA |
NA |
NA |
3rd Qu.: 1278 |
3rd Qu.: 30.00 |
3rd Qu.:71.70 |
3rd Qu.: 71.60 |
3rd Qu.: 59.10 |
3rd Qu.:73.90 |
3rd Qu.:21.20 |
3rd Qu.: 8.000 |
NA |
NA |
NA |
NA |
3rd Qu.:23.00 |
3rd Qu.:10.00 |
3rd Qu.:2021 |
|
|
NA |
NA |
NA |
Max. : 7592116 |
Max. :1163.00 |
Max. :92.90 |
Max. :103.20 |
Max. : 76.90 |
Max. :98.10 |
Max. :58.50 |
Max. :10.000 |
NA |
NA |
NA |
NA |
Max. :31.00 |
Max. :12.00 |
Max. :2023 |
Date and time of started_at is formarted as the date
month year(Y-m-d)
The Blank Spaces in the start_station_name are
replaced with the NA
Data cleaning and transformation of the member_type
column standardizes the capitalization of casual and
member categories by replacing inconsistent values with
lowercase. It counts the number of rows with the “Unknown” value in the
member_type column and removes those rows from the data
frame since they can’t be classified. Finally, convering the
member_type column to a factor for efficient storage and
analysis.
Converting duration column to a numeric format and
rounding it to two decimal places, and removing rows with negative
values.
Standardizing weather categories in the
CaBi data frame by grouping similar weather conditions
together and converting it to a factor for efficient storage and
analysis.
| Old Value | New Value |
|---|---|
| Partially cloudy | Cloudy |
| Rain, Overcast | OvercastRain |
| Rain, Partially cloudy | Rain |
| Snow, Rain, Overcast | Overcast |
| Snow, Rain, Partially cloudy | Cloudy |
| Snow, Partially cloudy | Snow |
| Snow, Overcast | OvercastSnow |
| Snow, Rain | Rain |
Converting “weekday” column to a factor for efficient storage and analysis.
holiday Variable has names of holidays and they are
replaced with “holiday” value and the NULL is replaced with “not
holiday”.
Converting season column to a factor for efficient
storage and analysis.
Removing the date column from the “CaBi” data frame
as it is no longer needed for the analysis.
Converting month column to a factor and changing
factor levels to month names for easier interpretation.
Converting year column to numeric for efficient
storage and analysis.
Printing a summary of the “CaBi” data frame after Cleaning and Formating
|
started_at |
start_station_name |
member_type |
duration |
noofbikes |
temperature |
feelsliketemp |
dew |
humidity |
windspeed |
uvindex |
weather |
weekday |
holiday |
season |
month |
year |
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Min. :2010-09-20 00:00:00.00 |
Length:1608808 |
casual: 344358 |
Min. : 1 |
Min. : 1.00 |
Min. :-9.70 |
Min. :-16.40 |
Min. :-19.20 |
Min. :19.00 |
Min. : 4.30 |
Min. : 0.000 |
Clear : 70997 |
Weekday:1158685 |
holiday :1005209 |
Fall :415069 |
October :143030 |
Min. :2010 |
|
|
1st Qu.:2015-12-25 00:00:00.00 |
Class :character |
member:1264450 |
1st Qu.: 625 |
1st Qu.: 3.00 |
1st Qu.:26.60 |
1st Qu.: 25.50 |
1st Qu.: 17.20 |
1st Qu.:53.50 |
1st Qu.:12.30 |
1st Qu.: 4.000 |
Cloudy :864081 |
Weekend: 450123 |
not holiday: 603599 |
Spring:400827 |
August :141278 |
1st Qu.:2015 |
|
|
Median :2018-10-15 00:00:00.00 |
Mode :character |
NA |
Median : 845 |
Median : 11.00 |
Median :50.60 |
Median : 49.20 |
Median : 37.10 |
Median :64.00 |
Median :15.80 |
Median : 6.000 |
Overcast : 65948 |
NA |
NA |
Summer:415276 |
March :140401 |
Median :2018 |
|
|
Mean :2018-05-28 16:01:43.94 |
NA |
NA |
Mean : 1473 |
Mean : 21.88 |
Mean :48.46 |
Mean : 47.43 |
Mean : 36.55 |
Mean :63.56 |
Mean :17.37 |
Mean : 5.806 |
OvercastRain:168746 |
NA |
NA |
Winter:377636 |
July :139873 |
Mean :2018 |
|
|
3rd Qu.:2021-03-06 00:00:00.00 |
NA |
NA |
3rd Qu.: 1278 |
3rd Qu.: 30.00 |
3rd Qu.:71.70 |
3rd Qu.: 71.60 |
3rd Qu.: 59.10 |
3rd Qu.:73.90 |
3rd Qu.:21.20 |
3rd Qu.: 8.000 |
OvercastSnow: 293 |
NA |
NA |
NA |
September:138118 |
3rd Qu.:2021 |
|
|
Max. :2023-03-31 00:00:00.00 |
NA |
NA |
Max. :7592116 |
Max. :1163.00 |
Max. :92.90 |
Max. :103.20 |
Max. : 76.90 |
Max. :98.10 |
Max. :58.50 |
Max. :10.000 |
Rain :432429 |
NA |
NA |
NA |
May :134607 |
Max. :2023 |
|
|
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Snow : 6314 |
NA |
NA |
NA |
(Other) :771501 |
NA |
The output plot shows that the highest percentage of bike rentals
occurred during the summer season, followed by fall, spring, and winter.
The chart provides a clear visualization of the differences in bike
rental demand across different seasons.
The below scatter plot shows that bike rental demand is influenced by both temperature and season. It indicates that the demand is highest in the spring and summer, with a peak in May and June, and less in the winter.
The below plot shows that bike rental demand is higher on cloudy and clear days, followed by rainy days, and lower on days with overcast snow.
The below scatter plot displays the distribution of bike rentals by weather condition, with each point representing the number of bikes rented during a specific weather condition. The plot indicates that the highest demand for bike rentals occurs on cloudy and clear days, followed by rainy and overcast days.
The resulting heatmap shows that there are some strong positive correlations between certain variables, such as temperature and feelsliketemp, and weaker correlations between other variables. For example, there is a negative correlation between humidity and windspeed.
| noofbikes | temperature | feelsliketemp | dew | humidity | windspeed | uvindex | |
|---|---|---|---|---|---|---|---|
| noofbikes | 1.00 | 0.19 | 0.19 | 0.17 | -0.02 | -0.10 | 0.19 |
| temperature | 0.19 | 1.00 | 1.00 | 0.97 | 0.21 | -0.53 | 0.38 |
| feelsliketemp | 0.19 | 1.00 | 1.00 | 0.98 | 0.23 | -0.52 | 0.39 |
| dew | 0.17 | 0.97 | 0.98 | 1.00 | 0.43 | -0.54 | 0.27 |
| humidity | -0.02 | 0.21 | 0.23 | 0.43 | 1.00 | -0.25 | -0.39 |
| windspeed | -0.10 | -0.53 | -0.52 | -0.54 | -0.25 | 1.00 | -0.07 |
| uvindex | 0.19 | 0.38 | 0.39 | 0.27 | -0.39 | -0.07 | 1.00 |
The plot shows the total number of bikes rented in each year grouped by the weekday. The plot suggests that there is a higher demand for bikes on weekdays compared to weekends, and the demand for bikes has been increasing over the years.
Aggregating the total number of bikes rented by holiday and creates a bar plot to show the percentage of total bikes rented for each holiday category (Yes or No). The plot also shows the number of bikes rented for each holiday category.
Expect for the colder months like January, February, November and December, all the other months have good average.
Summarizing the total number of bikes rented by month, and creates a pie chart showing the proportion of total bikes rented by month. The pie chart uses a different color for each month.
The plot is helpful in identifying the bike stations with the highest and lowest demand for bikes over the years, and it can be useful for bike-sharing companies to make strategic decisions. Companies can focus more on bike stations with high demand, while reducing resources allocated to bike stations with low demand.
The plot shows that the bike demand for holidays is higher than that
of non-holidays for most of the years. However, in 2020, there was a
sharp drop in bike demand for both holidays and non-holidays, which
could be attributed to the COVID-19 pandemic. At the end 2023 has the
data of only 3 months so, the noofbikes count is low.
Instead of doing the sampling with the large data, we mutated the
noofbikes with the group of started_at which
was previously did with both started_at and
start_station_name. Droped the few
variables(started_at,start_station_name,member_type,duration,month,year)
with are not required during the modeling.
The total noof rows now changes to 4572 from 1608808.
Summary of data that is further used in the analysis :
|
noofbikes |
temperature |
feelsliketemp |
dew |
humidity |
windspeed |
uvindex |
weather |
weekday |
holiday |
season |
|
|---|---|---|---|---|---|---|---|---|---|---|---|
|
Min. : 21 |
Min. :-9.70 |
Min. :-16.40 |
Min. :-19.20 |
Min. :19.00 |
Min. : 4.30 |
Min. : 0.000 |
Clear : 200 |
Weekday:3268 |
holiday :2861 |
Fall :1164 |
|
|
1st Qu.: 4642 |
1st Qu.:35.70 |
1st Qu.: 30.30 |
1st Qu.: 20.77 |
1st Qu.:53.00 |
1st Qu.:12.20 |
1st Qu.: 4.000 |
Cloudy :2379 |
Weekend:1304 |
not holiday:1711 |
Spring:1135 |
|
|
Median : 7533 |
Median :53.40 |
Median : 52.40 |
Median : 40.25 |
Median :63.70 |
Median :15.30 |
Median : 6.000 |
Overcast : 224 |
NA |
NA |
Summer:1104 |
|
|
Mean : 7699 |
Mean :51.23 |
Mean : 49.96 |
Mean : 38.75 |
Mean :63.41 |
Mean :16.68 |
Mean : 5.895 |
OvercastRain: 533 |
NA |
NA |
Winter:1169 |
|
|
3rd Qu.:10847 |
3rd Qu.:72.40 |
3rd Qu.: 72.40 |
3rd Qu.: 59.52 |
3rd Qu.:73.90 |
3rd Qu.:20.12 |
3rd Qu.: 8.000 |
OvercastSnow: 3 |
NA |
NA |
NA |
|
|
Max. :19531 |
Max. :92.90 |
Max. :103.20 |
Max. : 76.90 |
Max. :98.10 |
Max. :58.50 |
Max. :10.000 |
Rain :1210 |
NA |
NA |
NA |
|
|
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Snow : 23 |
NA |
NA |
NA |
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| season | 3 | 17723779732 | 5907926577 | 573.7721 | 0 |
| Residuals | 4568 | 47035062895 | 10296642 | NA | NA |
The p-value is less than 0.05 (p < 0.05), which indicates that there is a significant difference in the mean number of bikes rented across different seasons. Therefore, we can reject the null hypothesis that there is no significant difference in the mean number of bikes rented across different seasons. The F-value of 573.8 is also quite large, which further supports the conclusion that the mean number of bikes rented across different seasons is significantly different.
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| holiday | 1 | 8900068 | 8900068 | 0.6281598 | 0.4280723 |
| Residuals | 4570 | 64749942560 | 14168478 | NA | NA |
The p-value for holiday is 0.428, which is greater than 0.05, the typical significance level used in statistical analysis. This means that we fail to reject the null hypothesis that there is no significant difference in the mean number of bikes rented on holidays and non-holidays.
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| weekday | 1 | 2586542 | 2586542 | 0.1825383 | 0.669221 |
| Residuals | 4570 | 64756256086 | 14169859 | NA | NA |
The weekday variable has a p-value of 0.669, which is also greater than 0.05. This suggests that there is not enough evidence to reject the null hypothesis that there are no differences in the mean number of bikes rented between groups.
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| weather | 6 | 5372980354 | 895496726 | 68.83697 | 0 |
| Residuals | 4565 | 59385862274 | 13008951 | NA | NA |
The F-value is 68.84, and the p-value is less than 2e-16, which is much smaller than the significance level of 0.05. This suggests that we can reject the null hypothesis and conclude that there is a significant difference in the mean noofbikes across different levels of weather.
##
## Call:
## lm(formula = noofbikes ~ temperature, data = CaBi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9036.6 -2774.4 -23.8 2674.9 13176.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4958.313 126.853 39.09 <2e-16 ***
## temperature 53.497 2.254 23.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3552 on 4570 degrees of freedom
## Multiple R-squared: 0.1098, Adjusted R-squared: 0.1096
## F-statistic: 563.5 on 1 and 4570 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = noofbikes ~ feelsliketemp, data = CaBi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9214.6 -2687.0 -7.8 2564.5 13181.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4919.579 115.113 42.74 <2e-16 ***
## feelsliketemp 55.640 2.059 27.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3495 on 4570 degrees of freedom
## Multiple R-squared: 0.1378, Adjusted R-squared: 0.1376
## F-statistic: 730.4 on 1 and 4570 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = noofbikes ~ dew, data = CaBi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9115.8 -2772.0 -18.3 2758.2 13144.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5700.21 103.63 55.01 <2e-16 ***
## dew 51.58 2.30 22.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3573 on 4570 degrees of freedom
## Multiple R-squared: 0.09912, Adjusted R-squared: 0.09893
## F-statistic: 502.8 on 1 and 4570 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = noofbikes ~ humidity, data = CaBi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7650.8 -3042.3 -199.1 3147.9 11663.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8378.319 254.416 32.932 < 2e-16 ***
## humidity -10.710 3.915 -2.736 0.00625 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3761 on 4570 degrees of freedom
## Multiple R-squared: 0.001635, Adjusted R-squared: 0.001416
## F-statistic: 7.483 on 1 and 4570 DF, p-value: 0.006252
##
## Call:
## lm(formula = noofbikes ~ windspeed, data = CaBi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7853.6 -3047.6 -145.1 3108.2 11724.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8654.757 149.343 57.95 < 2e-16 ***
## windspeed -57.301 8.317 -6.89 6.35e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3745 on 4570 degrees of freedom
## Multiple R-squared: 0.01028, Adjusted R-squared: 0.01006
## F-statistic: 47.47 on 1 and 4570 DF, p-value: 6.349e-12
##
## Call:
## lm(formula = noofbikes ~ uvindex, data = CaBi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8962.9 -2494.0 93.2 2376.8 12489.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3566.54 123.18 28.95 <2e-16 ***
## uvindex 701.04 19.17 36.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3311 on 4570 degrees of freedom
## Multiple R-squared: 0.2263, Adjusted R-squared: 0.2261
## F-statistic: 1337 on 1 and 4570 DF, p-value: < 2.2e-16
| Predictor Variable | Intercept | Estimate | Adjusted R-squared | p-value | F-statistic |
|---|---|---|---|---|---|
| temperature | 5087.7 | 110.8 | 0.2382 | < 2.2e-16 | 1227 |
| feelsliketemp | 4507.6 | 95.7 | 0.2113 | < 2.2e-16 | 1013 |
| dew | 5700.2 | 51.6 | 0.0989 | < 2.2e-16 | 502.8 |
| humidity | 8378.3 | -10.7 | 0.0014 | 0.00625 | 7.483 |
| windspeed | 8654.8 | -57.3 | 0.0101 | 6.35e-12 | 47.47 |
| uvindex | 3566.5 | 701.0 | 0.2261 | < 2.2e-16 | 1337 |
The total noof observations from the Training data
CaBitrain is 3660 observations and in test data set
CaBitest is 912 observations
## Start: AIC=58835.95
## noofbikes ~ temperature + dew + humidity + windspeed + uvindex
##
## Df Sum of Sq RSS AIC
## <none> 3.4955e+10 58836
## - windspeed 1 49588986 3.5005e+10 58839
## - humidity 1 2242583738 3.7198e+10 59062
## - temperature 1 3125373823 3.8081e+10 59147
## - dew 1 3359115712 3.8314e+10 59170
## - uvindex 1 4826876083 3.9782e+10 59307
##
## Call:
## lm(formula = noofbikes ~ temperature + dew + humidity + windspeed +
## uvindex, data = CaBitrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9269.2 -2169.1 52.2 2220.3 10792.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22088.690 1239.763 17.817 <2e-16 ***
## temperature -518.551 28.689 -18.075 <2e-16 ***
## dew 597.028 31.861 18.739 <2e-16 ***
## humidity -223.698 14.610 -15.311 <2e-16 ***
## windspeed -20.735 9.107 -2.277 0.0229 *
## uvindex 604.906 26.929 22.463 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3093 on 3654 degrees of freedom
## Multiple R-squared: 0.3202, Adjusted R-squared: 0.3193
## F-statistic: 344.2 on 5 and 3654 DF, p-value: < 2.2e-16
The initial AIC value is 58835.95, and the final AIC value after
fitting the linear regression model is 58836. This indicates that the
model with all the predictor variables, temperature,
dew, humidity, windspeed, and
uvindex, is a good fit for the data.
The R-squared value of the model is 0.3202, indicating that 32.02% of
the variance in noofbikes can be explained by the predictor
variables in the model. The adjusted R-squared value is 0.3193, which
adjusts for the number of predictor variables in the model.
Overall, the results suggest that the combination of
temperature, dew, humidity,
windspeed, and uvindex can be used to predict
the number of bikes rented in CaBitrain.
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9441.0 -1930.9 137.3 2095.3 10588.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15817.360 1266.647 12.488 < 2e-16 ***
## temperature -307.299 29.228 -10.514 < 2e-16 ***
## dew 339.211 32.946 10.296 < 2e-16 ***
## humidity -105.848 15.250 -6.941 4.59e-12 ***
## windspeed -18.276 9.137 -2.000 0.04554 *
## uvindex 419.655 28.108 14.930 < 2e-16 ***
## weatherCloudy 265.716 243.519 1.091 0.27528
## weatherOvercast -1056.564 335.119 -3.153 0.00163 **
## weatherOvercastRain -1685.304 311.890 -5.404 6.95e-08 ***
## weatherOvercastSnow -2679.742 1702.874 -1.574 0.11565
## weatherRain -539.624 268.745 -2.008 0.04472 *
## weatherSnow -1219.139 770.361 -1.583 0.11361
## weekdayWeekend -56.367 106.984 -0.527 0.59832
## `holidaynot holiday` -54.526 100.847 -0.541 0.58876
## seasonSpring -623.145 146.181 -4.263 2.07e-05 ***
## seasonSummer 616.347 153.350 4.019 5.96e-05 ***
## seasonWinter -2687.742 154.018 -17.451 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2913 on 3643 degrees of freedom
## Multiple R-squared: 0.3988, Adjusted R-squared: 0.3961
## F-statistic: 151 on 16 and 3643 DF, p-value: < 2.2e-16
The first model, which includes all five independent variables
(temperature, dew, humidity,
windspeed, and uvindex), has an adjusted
R-squared value of 0.3193, indicating that these variables explain
31.93% of the variability in the dependent variable. The F-statistic is
significant (p-value < 2.2e-16), indicating that the model is a good
fit.
The second model includes all of the independent variables from the
first model, as well as additional categorical variables
(weather, weekday, holiday, and
season). This model has an adjusted R-squared value of
0.3961, indicating that these variables explain 39.61% of the
variability in the dependent variable. The F-statistic is also
significant (p-value < 2.2e-16), indicating that the model is a good
fit.
In both models, temperature, dew,
humidity, windspeed, and uvindex
are all significant predictors of noofbikes. Additionally,
certain categories of the categorical variables are also significant
predictors
(i.e., weatherOvercast, weatherOvercastRain, weatherRain, seasonSpring, seasonSummer, and seasonWinter).
The other categorical variables
(weatherCloudy, weatherOvercastSnow, weekdayWeekend, and holidayNot holiday)
are not significant predictors of noofbikes.
The resampling was performed using a 5-fold cross-validation with 3 repetitions, resulting in a total of 15 iterations. The summary of sample sizes indicates that each fold had approximately the same number of samples.
The results show that the optimal value for maxdepth is 8, which gives an RMSE of 2959.329, an R-squared of 0.3784168, and an MAE of 2397.165.
The most important feature is temperature with a value
of 100, followed by dew with a value of 65.81,
uvindex with a value of 38.52, and so on. The least
important feature in this model appears to be windspeed
with a value of 7.44.
| R2 | RMSE | MAE |
|---|---|---|
| 0.3665989 | 3044.444 | 2440.491 |
Decision Tree with 36.65% of the variability in the target variable. The root mean squared error (RMSE) is 3044.444 . The mean absolute error (MAE) is 2440.491.
With the tuned Decision Tree the R2 is low, for which trying the bagged Decision Tree could help in increase of R2
Temperature has the highest importance score of 100, followed by dew, UV index, humidity, and season. Other variables such as windspeed, weather, and weekday/weekend have relatively lower importance scores, indicating that they have a lesser impact on the target variable.
| R2 | RMSE | MAE |
|---|---|---|
| 0.4031066 | 2957.301 | 2388.401 |
The R2 score is 0.4016212, which means that the predictor variables explain around 40% of the variance in the target variable.The RMSE is 2961.255, which means that on average, the predicted values are about 2961.255 units away from the actual values.
| R2 | RMSE | MAE |
|---|---|---|
| 0.5086464 | 2682.561 | 2117.371 |
The R2 score is 0.5086464, which means that the predictor variables explain around 51% of the variance in the target variable.The RMSE is 2682.561, which means that on average, the predicted values are about 2682.561 units away from the actual values.
From the Above models and results the Accuracy is only 50%. For this the data is randomly split into 80% and 20%. So, Lets remove the covid years from Data set.
Choosing the training data set with years 2010 to 2018 and testing data with years 2022 and 2023. Here 2019 2020 2021 years were ignored for which they are the covid years.
CaBitraincovid is 3476 observations and in test data set
CaBitestcovid is 455 observationsThe resampling was performed using a 5-fold cross-validation with 3 repetitions, resulting in a total of 15 iterations. The summary of sample sizes indicates that each fold had approximately the same number of samples.
The results show that the optimal value for maxdepth is 8, which gives an RMSE of 2959.329, an R-squared of 0.3784168, and an MAE of 2397.165.
The most important feature is temperature with a value
of 100, followed by dew with a value of 65.81,
uvindex with a value of 38.52, and so on. The least
important feature in this model appears to be windspeed
with a value of 7.44.
| R2 | RMSE | MAE |
|---|---|---|
| 0.5849981 | 2442.709 | 1957.982 |
Decision Tree with 58.5% of the variability in the target variable. The root mean squared error (RMSE) is 2442.709. The mean absolute error (MAE) is 1957.982.
With the tuned Decision Tree the R2 is low, for which trying the bagged Decision Tree could help in increase of R2
we can see that temperature, dew, and seasonSummer are the top three variables that are most important for predicting the outcome variable. The importance of these variables decreases as we move down the list. Variables with importance measures close to zero are unlikely to contribute much to the model’s predictive power.
| R2 | RMSE | MAE |
|---|---|---|
| 0.6583164 | 2206.154 | 1762.025 |
The R2 value has increased to 0.6656262, which indicates a better fit of the model to the data. Additionally, the RMSE value has decreased to 2245.789, and the MAE value has decreased to 1796.862. These values suggest that the model is performing better in terms of predicting the number of bike rentals based on the input features.
| R2 | RMSE | MAE |
|---|---|---|
| 0.9633303 | 759.8413 | 581.2308 |
The Random Forest model has a higher R2 value and lower RMSE and MAE values compared to the regression tree. This suggests that the Random Forest model will be a better fit for the data and have better predictive performance.
ANOVA season and
weather are important factors in predicting the number of
bikes rented, while holiday and weekday do not
have a significant effect on the number of bikes rented.
Linear Regression All the variables included in the model have a significant impact on the number of bikes used.
Decision Tree temperature is the
most important variable with an importance score of 100.
Dew and UV index are the next most important
variables, followed by season (Summer, Spring, Winter),
humidity, weather (OvercastRain, Cloudy), and
windspeed, in decreasing order of importance.
Random Forest the model explains 58.86% of the
variance in the response variable with the important variables of
temperature, followed by season,
humidity, dew, uvindex, and
weather.
Decision Tree The most important feature is
temperature with a value of 100, followed by
dew with a value of 65.81, uvindex with a
value of 38.52, and so on. The least important feature in this model
appears to be windspeed with a value of 7.44.
Random Forest the model explains 96.33% of the
variance in the response variable with the important variables of
temperature, followed by season,humidity,
dew, uvindex, and
weather.
| Model Type | Sample Split R2 - R-squared | Sample Split Root Mean Square Error | Sample Split Mean Absolute Error | Non-COVID years Split R2 - R-squared | Non-COVID years Split Root Mean Square Error | Non-COVID years Split Mean Absolute Error |
|---|---|---|---|---|---|---|
| Tunned Decision Tree | 0.3665989 | 3044.444 | 2440.491 | 0.5849981 | 2442.709 | 1957.982 |
| Bagged Decision Tree | 0.4031066 | 2957.301 | 2388.401 | 0.6583164 | 2206.154 | 1762.025 |
| Random Forest | 0.5086464 | 2682.561 | 2117.371 | 0.9633303 | 759.8413 | 581.2308 |
Random Forest model performs the best out of the three models for both the sample split and the non-COVID years split. It has the highest R-squared value, and the lowest RMSE and MAE values. The Bagged Decision Tree model also performs reasonably well, with R-squared values over 0.4 and relatively low RMSE and MAE values. The Tuned Decision Tree model, on the other hand, has lower R-squared values and higher RMSE and MAE values than the other two models.
The COVID-19 pandemic was an unexpected circumstance that impacted bike trip demand in ways that may not have been accounted for in the models developed in this project. As a result, the models’ predictions for bike trip demand during the pandemic may be less accurate than their predictions for non-COVID years.
This highlights the importance of considering potential unforeseen circumstances and their potential impact on the accuracy of predictive models. It also underscores the importance of regularly updating and retraining models as new data becomes available, in order to account for changes in the underlying data and any new unforeseen circumstances that may arise.
Capital Bikeshare. (n.d.). System Data. https://www.capitalbikeshare.com/system-data
Visual Crossing. (n.d.). Weather Data Services. https://www.visualcrossing.com/weather/weather-data-services
Time and Date AS. (n.d.). Holidays and Observances in United States in [year]. https://www.timeanddate.com/holidays/us/[year]/
dplyr: Hadley Wickham, Romain Francois, Lionel Henry and Kirill Müller (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.7. https://CRAN.R-project.org/package=dplyr
ezids: Scott Chamberlain (2017). ezid: Easy Handling of Online Data Archiving via the EZID System. R package version 0.3.0. https://CRAN.R-project.org/package=ezid
ggmap: David Kahle and Hadley Wickham (2013). ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. https://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
ggplot2: Hadley Wickham (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org/
tidyverse: Hadley Wickham, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, Alex Hayes, Lionel Henry, Jim Hester, Max Kuhn, Thomas Lin Pedersen, Evan Miller, Stephan Milton Bache, Kirill Müller, Jeroen Ooms, David Robinson, Dana Paige Seidel, Vitalie Spinu, Kohske Takahashi, Davis Vaughan, Claus Wilke, Kara Woo, and Hiroaki Yutani (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
highcharter: Joshua Kunst (2020). highcharter: A Wrapper for the ‘Highcharts’ Library. R package version 0.9.999. https://CRAN.R-project.org/package=highcharter
corrplot: Taiyun Wei and Viliam Simko (2017). R Package “corrplot”: Visualization of a Correlation Matrix. (Version 0.84). https://cran.r-project.org/web/packages/corrplot/corrplot.pdf
knitr: Yihui Xie (2021). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.36. https://yihui.org/knitr/
kableExtra: Hao Zhu (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4. https://CRAN.R-project.org/package=kableExtra
tidyr: Hadley Wickham and Lionel Henry (2020). tidyr: Tidy Messy Data. R package version 1.1.4. https://CRAN.R-project.org/package=tidyr
lubridate: Garrett Grolemund and Hadley Wickham (2021). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. https://doi.org/10.18637/jss.v040.i03
shiny: Winston Chang, Joe Cheng, JJ Allaire, Yihui Xie and Jonathan McPherson (2021). shiny: Web Application Framework for R. R package version 1.6.0. https://CRAN.R-project.org/package=shiny
igraph: Gábor Csárdi, Tamás
GitHub : https://github.com/mohiddin7/Final_project_DATS6101_SIM